Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for clipping reads that extend past their mate #761

Merged
merged 9 commits into from
Mar 1, 2022

Conversation

nh13
Copy link
Member

@nh13 nh13 commented Jan 14, 2022

Summary

This adds:

  1. methods in SamRecordClipper to clip bases that map beyond the end of the mate for FR pairs.
  2. Use these methods in ClipBam to implement the option --clip-bases-past-mate.
  3. Use these method in consensus calling to properly decide the number of bases to clip based on insert size for FR pairs. This fixes a bug in the consensus caller that could improperly trim small inserts.

More details on the bug in the consensus caller that could improperly trim small inserts

If the reads are longer than the insert size, but aligned with
insertions, the consensus read may have the insertion trimmed.

See issue #759. Corrected output to include the full length read:

@HD	VN:1.6	SO:queryname	GO:none
@RG	ID:A	LB:lib1	SM:A	PL:ILLUMINA	PU:unit1
@CO	Read group A contains consensus reads generated from 1 input read groups.
lib1:8193187/A	77	*	0	0	*	*	0	0	ACCATTCTAATTACAGTGGGGGAAGTGTGCGCGCAAGTGTGTGCGTGTGTGCGCACGCGCAGGTGTGTGCGTGAGTGAAGGGATGTAGACAGAGGCTGCTGGCCGGG	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	cD:i:1	cE:f:0.0	RG:Z:A	MI:Z:8193187/A	cM:i:1	RX:Z:TTA-TCT	cd:B:s,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1	ce:B:s,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
lib1:8193187/A	141	*	0	0	*	*	0	0	CCCGGCCAGCAGCCTCTGTCTACATCCCTTCACTCACGCACACACCTGCGCGTGCGCACACACGCACACACTTGCGCGCACACTTCCCCCACTGTAATTAGAATGGT	CCCCCCCCCCCCCCCCCCCCC9CCC9CCC9CCCCCCCCC9CCCCCCCCCCCCCCCCCCCCCCCCCCCCC99CCCCCCC9CCCCCCCCCCCCCCCC9CCCCCCCCCCC	cD:i:1	cE:f:0.0	RG:Z:A	MI:Z:8193187/A	cM:i:1	RX:Z:TTA-TCT	cd:B:s,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1	ce:B:s,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

@nh13 nh13 requested a review from tfenne January 14, 2022 06:13
// Adjust the insert size based on clipping on the 5' end of the given read.
// This *does not* consider any clipping on the 5' end of its mate.
val fivePrimeClipping = cigar.takeWhile(_.operator.isClipping).filter(_.operator == CigarOperator.S).sumBy(_.length)
val adjustedInsertSize = abs(rec.insertSize) + fivePrimeClipping
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr: using the insert size does not account for inserted bases!

@@ -532,6 +532,20 @@ class VanillaUmiConsensusCallerTest extends UnitSpec with OptionValues {
s2.cigar.toString shouldBe "46S96M"
}

it should "not to trim based on insert size in the presence of soft-clipping (+/-)" in {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just swapping R1 and R2 from the previous test

@@ -560,4 +574,18 @@ class VanillaUmiConsensusCallerTest extends UnitSpec with OptionValues {
source.cigar.toString() shouldBe "3M"
source.sam shouldBe Some(rec)
}

it should "not trim reads with an insert size shorter than the read length when there's an insertion in the middle" in {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the new test to confirm the new behavior. Previously, the insert size would be calculated as 80bp, so the consensus read would have been 80bp. This happened because the insert size doesn't count inserted bases in the insert. The correct length of the consensus bases is 100bp.

// 3' end. If not, then we can use the full read. This *does not* consider any clipping on the 5' end of its mate.
val adjustedReadLength = {
if (rec.positiveStrand) {
val mateEnd = rec.mateEnd.getOrElse(rec.start + abs(rec.insertSize) - 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If MC isn't present, this relies on the aligner setting insert size correctly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use CoordMath.getEnd here?

@codecov-commenter
Copy link

codecov-commenter commented Jan 14, 2022

Codecov Report

Merging #761 (51b839d) into master (7512807) will decrease coverage by 0.02%.
The diff coverage is 95.06%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #761      +/-   ##
==========================================
- Coverage   95.47%   95.45%   -0.03%     
==========================================
  Files         121      121              
  Lines        6855     6907      +52     
  Branches      463      479      +16     
==========================================
+ Hits         6545     6593      +48     
- Misses        310      314       +4     
Flag Coverage Δ
unittests 95.45% <95.06%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...c/main/scala/com/fulcrumgenomics/bam/ClipBam.scala 96.77% <81.81%> (-2.19%) ⬇️
...ala/com/fulcrumgenomics/bam/SamRecordClipper.scala 97.54% <93.75%> (-0.96%) ⬇️
...cala/com/fulcrumgenomics/alignment/Alignment.scala 100.00% <100.00%> (ø)
...a/com/fulcrumgenomics/umi/UmiConsensusCaller.scala 93.98% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7512807...51b839d. Read the comment docs.

Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might need to be walked through this PR. I understand the bug/goal, but have questions about the implementation.

I don't understand why the else branches are the way they are. It seems to be that they will now always remove 3' soft-clipping and I don't think that's desirable. I think there are three conceptual cases we need to walk through and make sure we agree on the expected outcome:

  1. A read in an FR pair is aligned such that it has aligned bases that extend past the "far" end of it's mate. This is simple, we know we want to trim any bases past the mate's far end.
  2. A read in an FR pair is aligned such that it doesn't have aligned bases that extend past the "far" end of it's mate, but is close and has soft-clipping that if converted to an M cigar, would put it past it's mate end. I'm not sure what to do in this case - the old code would have trimmed, but I'm not entirely sure that's the right thing to do.
  3. Reads are far apart (e.g. 2x150 with 350bp insert size) but have soft-clipping at the 3' end. In this case I believe we do not want to remove the clipping.

// 3' end. If not, then we can use the full read. This *does not* consider any clipping on the 5' end of its mate.
val adjustedReadLength = {
if (rec.positiveStrand) {
val mateEnd = rec.mateEnd.getOrElse(rec.start + abs(rec.insertSize) - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use CoordMath.getEnd here?

rec.readPosAtRefPos(pos=mateEnd, returnLastBaseIfDeleted=false)
} else {
// Find where the end of the read goes past the mate start, if at all
val refPosAtReadEnd = min(mateEnd, rec.end) // last 3' base
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? We're in an else under rec.end >= mateEnd therefore rec.end will always be <= mateEnd and therefore the value returned. Am I missing something?

} else {
// Find where the end of the read goes past the mate start, if at all
val refPosAtReadEnd = min(mateEnd, rec.end) // last 3' base
rec.readPosAtRefPos(pos=refPosAtReadEnd, returnLastBaseIfDeleted=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a while since I've been through this code, but this will remove any trailing clipping, insertions, and maybe deletions too. Is that intentional? I.e. if the cigar is 80M20S, say because the read runs off into a large insertion, this will return 80, and cause the read to get trimmed to 80 bases, no?

rec.readPosAtRefPos(pos=refPosAtReadEnd, returnLastBaseIfDeleted=false)
}
} else {
// Find where the start of the read that goes before the mate start, if at all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Find where the start of the read that goes before the mate start, if at all
// Find where the start of the read goes before the mate start, if at all

Just saying out loud that I find these comments hard to grok because they use start/end as then are in the SAM file, as opposed to e.g. if you're looking at reads in IGV (where directionality/strand is conveyed) I think we tend to flip start/end for reverse reads to match sequencing order.

nh13 added 3 commits February 25, 2022 10:27
If the reads are longer than the insert size, but aligned with
insertions, the consensus read may have the insertion trimmed.
Adds methods to SamRecordClipper to clip bases that map beyond the end
of the mate for FR pairs.

Use these methods in ClipBam to enable the option `--clip-bases-past-mate`.

Use these method in consensus calling to properly decide the number of
bases to clip based on insert size for FR pairs.
@nh13 nh13 force-pushed the bugfix/consensus-caller-trimming branch from 388ba43 to 13d32a8 Compare February 25, 2022 23:16
@nh13 nh13 changed the title Bug fix: Consensus caller could improperly trim small inserts Support for clipping reads that extend past their mate Feb 25, 2022
@@ -187,7 +188,7 @@ class SamRecordClipper(val mode: ClippingMode, val autoClipAttributes: Boolean)
}

/**
* Attempts to clip an additional numberOfBasesToClip from the 5' end of the read. For
* Attempts to clip an additional numberOfBasesToClip from the 3' end of the read. For
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc fix, unrelated but found while reading the code

@@ -233,7 +234,7 @@ class SamRecordClipper(val mode: ClippingMode, val autoClipAttributes: Boolean)
/**
* Ensures that there are at least clipLength bases clipped at the 5' end
* of the read, _including_ any existing soft and hard clipping. Calculates any
* additional clipping and delegates to [com.fulcrumgenomics.bam.SamRecordClipper.[clipStartOfAlignment]].
* additional clipping and delegates to [[com.fulcrumgenomics.bam.SamRecordClipper.clipStartOfAlignment]].
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc fix

@@ -162,7 +163,7 @@ class SamRecordClipper(val mode: ClippingMode, val autoClipAttributes: Boolean)
val oldElems = rec.cigar.elems.reverse
val oldBases = rec.bases.reverse
val oldQuals = rec.quals.reverse
val (newElems, readBasesClipped, refBasesClipped, bases, quals) = clip(oldElems, numberOfBasesToClip, oldBases, oldQuals)
val (newElems, readBasesClipped, _, bases, quals) = clip(oldElems, numberOfBasesToClip, oldBases, oldQuals)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intellij warning

@@ -289,6 +290,96 @@ class SamRecordClipper(val mode: ClippingMode, val autoClipAttributes: Boolean)
}
}

/** Clips overlapping read pairs, where both ends of the read pair are mapped to the same chromosome, and in FR orientation.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved from ClipBam

*
* @param rec the read
* @param mate the mate
* @param doNotClip do not clip the reads, only return the number of bases that would be clipped
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be convinced to update all the methods to have a mode where they don't actually clip, but just return how many additional bases they would clip.

@@ -440,7 +531,7 @@ class SamRecordClipper(val mode: ClippingMode, val autoClipAttributes: Boolean)
unreachable(s"Cigar $es doesn't contain soft clipping at the start")
}

elems = newElems.result
elems = newElems.result()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala warning

@@ -168,6 +170,9 @@ trait UmiConsensusCaller[ConsensusRead <: SimpleRead] {
/** A consensus caller used to generate consensus UMI sequences */
private val consensusBuilder = new SimpleConsensusCaller()

/** Clipper utility used to _calculate_ clipping, but not do the actual clipping */
private val clipper = new SamRecordClipper(mode=ClippingMode.Soft, autoClipAttributes=true)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have mode that is like "dry run", but wouldn't want that on the command line I don't think (e.g. in ClipBam). Not sure how to prevent that.

min(adjustedInsertSize, trimToLength) - 1
// Get the number of mapped bases to clip that maps beyond the mate's end. Soft-clipped bases are not counted.
val numBasesExtendingPasteMateEnd = this.clipper.clipExtendingPastMateEnd(rec=rec, doNotClip=true)
if (numBasesExtendingPasteMateEnd == 0) trimToLength - 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this matches the previous behavior. We don't trim if only soft-clipped bases are "beyond the end of the mate"

@@ -281,3 +281,23 @@ class AlignmentTest extends UnitSpec {
alignment.subByTarget(6, 11).cigar.toString() shouldBe "5D1="
}
}

class CigarStatsTest extends UnitSpec {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we didn't have tests for these, tsk tsk

@@ -524,4 +524,16 @@ class ClipBamTest extends UnitSpec with ErrorLogLevel with OptionValues {
}
}
}

it should "clip FR reads that extend past the mate" in {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SamRecordClipper has nice tests, so just wrote the minimal test here to ensure it works in ClipBam.

@@ -211,7 +211,6 @@ class AnnotateBamWithUmisTest extends UnitSpec {
Io.writeLines(reverseFq, Io.readLines(fq).grouped(4).toSeq.reverse.flatten)
val annotator = new AnnotateBamWithUmis(input=sam, fastq=Seq(fq, reverseFq), readStructure=Seq(ReadStructure("2M4B+M"), ReadStructure("1B+M")), output=out, attribute=umiTag, sorted=true)
val result = intercept[Exception] { annotator.execute() }
println(result)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's one, set for stun

s2.cigar.toString shouldBe "46S96M"
}

it should "not trim based on insert size in the presence of soft-clipping (+/-)" in {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests mirrors the previous one, but swaps R1 and R2

Comment on lines 81 to 82
@arg( doc="Clip overlapping reads.", mutex=Array("clipBasesPastMate")) val clipOverlappingReads: Boolean = false,
@arg( doc="Clip reads who sequence past the start of their mate.", mutex=Array("clipOverlappingReads")) val clipBasesPastMate: Boolean = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's awkward to make these mutex. Yes, one implies the other, but they don't contradict each other. E.g. I could totally see a pipeline that wants to always set --clip-bases-past-mate and optionally set --clip-overlapping-reads, and now it'll have to know to unset the first option based on the second, when it's not strictly necessary.

Copy link
Member Author

@nh13 nh13 Feb 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the mutex, since one is a subset of the other. If this was written from scratch, I'd prefer an enum that was "None", "PastMate", "Overlapping" for a single argument, but here we are. Perhaps we can deprecate clipOverlappingReads and make an Option[PairClipping] type, where PairClipping is PastMate or Overlapping. Does that work for you?

src/main/scala/com/fulcrumgenomics/bam/ClipBam.scala Outdated Show resolved Hide resolved
rec.cigar.toString shouldBe "100M"
mate.start shouldBe 1
mate.cigar.toString shouldBe "100M"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want a test with insertions in the cigar strings since that was the original bug?
Also would be nice to have tests with asymmetry - i.e. one read does extend past it's mate but the other doesn't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Copied bug test case from here to this file.
  2. See the test case "clip reads that where only one end extends their mate's start". Added a second one with insertions

Copy link
Member Author

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfenne I think this fixes up all the issues, except those that I left unresolved. I am not 100% in love with the way we trim back overlapping reads with deletions, since in some cases (see added tests) selecting a mid point in the overlapping reference window can be within the deletion, so after trimming neither read spans the deletion. For example, two fully overlapping reads that are 50M10D50M, the mid point is right in the middle of the deletion, so we'll get 50M50S for one read and 50S50M for the other. I think it's more work than I am willing to do to try to keep the deletion in one of the two reads.

Comment on lines 402 to 410
val numBasesToClip = if (totalClippedBases == 0) 0 else {
// Bases past the end, so get the additional # of bases to clip
val existingClippedBases = {
val iter = if (rec.positiveStrand) rec.cigar.reverseIterator else rec.cigar.iterator
iter.takeWhile(_.operator.isClipping).filter(_.operator == Op.S).sumBy(_.length)
}
Math.max(0, totalClippedBases - existingClippedBases)
}
this.clip3PrimeEndOfAlignment(rec, numBasesToClip)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if the read is positive strand (the record isn't assumed to be positive!). Will condition on strand and use EndOfRead or StartOfRead depending.

@nh13 nh13 unassigned tfenne Feb 28, 2022
@nh13 nh13 self-assigned this Feb 28, 2022
@nh13 nh13 added the 2.0 label Feb 28, 2022
@tfenne tfenne assigned tfenne and nh13 and unassigned nh13 and tfenne Feb 28, 2022
@nh13 nh13 merged commit d69114f into main Mar 1, 2022
@nh13 nh13 deleted the bugfix/consensus-caller-trimming branch March 1, 2022 04:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants